Skip to content

Add raw search fast path with code domain support#174

Open
michael-schvarcz wants to merge 4 commits into
XortexAI:mainfrom
michael-schvarcz:codex/raw-search-fast-path
Open

Add raw search fast path with code domain support#174
michael-schvarcz wants to merge 4 commits into
XortexAI:mainfrom
michael-schvarcz:codex/raw-search-fast-path

Conversation

@michael-schvarcz
Copy link
Copy Markdown

Addresses #163.

This adds a low-latency raw search path that returns ranked memory hits without the agentic LLM planning/synthesis round, while keeping answer generation optional through answer=true.

What changed:

  • Adds RetrievalPipeline.raw_search for direct profile, temporal, summary, and snippet retrieval without tool selection.
  • Extends /v1/memory/search to optionally include code hits by querying the existing code retrieval pipeline's native symbol/file search tools.
  • Adds answer=true synthesis after raw hits are collected, without a second planning step.
  • Caches profile catalogs and agentic retrieval plans.
  • Tracks bounded p50/p95/p99 latency snapshots by retrieval mode.
  • Adds tests proving raw search bypasses LLM tool planning, answer mode only synthesizes after hits, and cached plans are reused.

Compared with the currently open #173, this implementation also wires the requested code domain into the raw search endpoint.

Validation:

.venv/bin/python -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py -q
9 passed

git diff --check
passed

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR expands the memory search capabilities by adding a "code" domain for repository-level searches and introducing a fast raw_search path that avoids LLM planning. Key architectural additions include a latency tracking system for performance monitoring and caching mechanisms for retrieval plans and profile catalogs. The review feedback highlights several critical improvements: addressing a potential TypeError when rounding null scores, implementing bounded caches to prevent memory leaks, maintaining proper encapsulation of private methods, and enhancing the robustness of concurrent tool execution through better error handling.

Comment thread src/api/routes/memory.py Outdated
SourceRecord(
domain=s.domain,
content=s.content,
score=round(s.score, 3),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The score attribute of a SourceRecord can be None (as handled in the sorting logic on line 728). Attempting to call round(None, 3) will raise a TypeError. It is safer to provide a default value of 0.0 if the score is missing.

Suggested change
score=round(s.score, 3),
score=round(s.score, 3) if s.score is not None else 0.0,

Comment thread src/pipelines/retrieval.py Outdated
Comment on lines +172 to +175
self._profile_catalog_cache: Dict[
str, Tuple[float, List[Dict[str, str]], List[Any]]
] = {}
self._retrieval_plan_cache: Dict[str, List[Dict[str, Any]]] = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _profile_catalog_cache and _retrieval_plan_cache are implemented as unbounded dictionaries. In a long-running API process, these will grow indefinitely as different users and queries are processed, potentially leading to memory exhaustion. It is recommended to use a cache with a maximum size or an eviction policy (e.g., cachetools.TTLCache or a simple LRU implementation).

Comment thread src/api/routes/memory.py Outdated
data = SearchResponse(results=all_results, total=len(all_results))
answer = ""
if req.answer:
answer = await pipeline._answer_from_sources(query=req.query, sources=records)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line accesses _answer_from_sources, which is a private method of the RetrievalPipeline class. To maintain proper encapsulation, this method should be made public (e.g., renamed to answer_from_sources) if it is intended to be used outside the class.

Comment thread src/pipelines/retrieval.py Outdated
if include_answer:
answer = await self._answer_from_sources(query=query, sources=sources)

confidence = min(1.0, len(sources) * 0.15) if sources else 0.1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The confidence calculation logic here (len(sources) * 0.15) is inconsistent with the logic used in the agentic run method on line 308 (len(sources) * 0.2). Consider unifying this logic or defining the multiplier as a constant to ensure consistent confidence scoring across different retrieval modes.

Comment thread src/api/routes/memory.py
Comment on lines +709 to +724
code_results = await asyncio.gather(
code_pipeline._execute_tool(
tool_name="search_symbols",
tool_args={"query": req.query, "repo": req.repo},
repo=req.repo,
top_k=req.top_k,
user_id=user_id,
),
code_pipeline._execute_tool(
tool_name="search_files",
tool_args={"query": req.query, "repo": req.repo},
repo=req.repo,
top_k=req.top_k,
user_id=user_id,
),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If any of the code retrieval tool calls fail, the entire search request will return a 500 error. For a more robust search experience, consider using asyncio.gather(..., return_exceptions=True) and filtering out failed results, allowing the user to see memory hits even if the code domain search is temporarily unavailable.

@michael-schvarcz michael-schvarcz force-pushed the codex/raw-search-fast-path branch from 8aa3426 to bb08f07 Compare May 11, 2026 18:43
@michael-schvarcz
Copy link
Copy Markdown
Author

Updated this PR to address the review points:

  • Normalizes missing source scores through a shared helper before API serialization.
  • Keeps raw memory search resilient when one requested backend fails by using asyncio.gather(..., return_exceptions=True) and returning healthy domain results.
  • The current head already uses bounded LRU-style caches, public answer_from_sources / confidence_from_sources helpers, and the shared confidence multiplier.

Validation: .venv/bin/python -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py -> 9 passed.

Comment thread src/pipelines/retrieval.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked the Gemini items and they look addressed. One remaining edge case I noticed is in src/pipelines/retrieval.py::_format_tool_results: API serialization now normalizes None scores, but answer=true can still fail before serialization because answer_from_sources() formats sources via _format_tool_results(), which does rec.score > 0 / {rec.score:.2f}.

If any backend returns SourceRecord(score=None), raw search without answer works, but raw search with answer=true can raise there. Can we normalize the score inside _format_tool_results() too, e.g. score = rec.score or 0.0, before formatting?

After that, this looks good to me.

Copy link
Copy Markdown
Collaborator

@Ankit-Kotnala Ankit-Kotnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good and the Gemini review items appear addressed.

I left one blocking edge-case comment around _format_tool_results(): answer=true can still fail if any backend returns SourceRecord(score=None), because the answer formatting path uses rec.score > 0 / {rec.score:.2f} before API serialization normalizes scores.

Once that score normalization is handled in _format_tool_results(), I’m happy with this PR.

@michael-schvarcz
Copy link
Copy Markdown
Author

Fixed the remaining missing-score edge case in _format_tool_results() by normalizing None to 0.0 before score comparison/formatting, so answer=true no longer crashes before API serialization.

Added regression coverage for answer_from_sources() with SourceRecord(score=None).

Validation:

.venv/bin/python -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py -q
# 10 passed
git diff --check

@nagiexplorer88
Copy link
Copy Markdown

One score-hardening gap still remains after 89fe3e4. _score_value() only maps None to 0.0:

def _score_value(score: float | None) -> float:
    return round(score, 3) if score is not None else 0.0

If any vector/code backend returns a non-finite float (nan or inf), round(score, 3) preserves that non-finite value. The /v1/memory/search path then places it into SearchResponse.results[*].score, and _wrap() passes body.model_dump() into JSONResponse. Starlette JSON responses reject non-finite floats, so raw search can still fail at response rendering even though the missing-score case is covered.

This also affects retrieve_memory() because it uses the same helper for result.sources. A small float(...) + math.isfinite(...) coercion before rounding would close both paths.

Copy link
Copy Markdown
Collaborator

@Ankit-Kotnala Ankit-Kotnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this addresses my concern. _format_tool_results() now safely handles missing source scores before formatting, and the added regression test covers answer_from_sources() with score=None.

Looks good to merge from my side.

@ishaanxgupta you can merge this if there aren't any concerns on your side.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR adds a low-latency raw_search fast path to RetrievalPipeline that retrieves ranked memory hits across profile, temporal, summary, and snippet domains without an LLM planning step, and extends POST /v1/memory/search with optional code-domain support (via the existing CodeRetrievalPipeline) and optional answer synthesis (answer=true).

  • RetrievalPipeline.raw_search: gathers results from all requested domains in parallel, ranks by score, and optionally calls answer_from_sources — bypassing the two-step agentic plan entirely.
  • Caching: _profile_catalog_cache (LRU + 5 min TTL) avoids repeated Pinecone metadata scans; _retrieval_plan_cache (LRU, no TTL) reuses LLM tool-selection plans for identical query+catalog pairs.
  • search_memory route rewrite: merges memory and code results, applies _score_value normalization, records per-mode latency, and returns SearchResponse enriched with answer, confidence, and latency fields.

Confidence Score: 3/5

Safe to merge with low risk, but several small correctness issues should be addressed before the latency and sort paths are relied upon in production.

The core raw_search logic and code-domain wiring look correct and are well-tested. Four issues reduce confidence: the singleton pipeline's global latency snapshot is returned in every SearchResponse (leaking system-wide telemetry to any API caller); the retrieval plan cache has no TTL, so stale plans can persist across model upgrades as long as the catalog text doesn't change; _rank_sources and the route-level sort both pass NaN/Inf scores unchanged to Python's sort (which produces undefined ordering for NaN keys); and the strip_search_query validator runs after the min_length constraint, allowing a whitespace-only query to arrive at vector stores as an empty string.

src/pipelines/retrieval.py (_rank_sources, _cache_retrieval_plan) and src/api/routes/memory.py (sort call, latency exposure) deserve a second look before this ships to production.

Important Files Changed

Filename Overview
src/pipelines/retrieval.py Adds raw_search fast path, profile catalog cache (5 min TTL), LRU retrieval plan cache (no TTL), latency tracker, and helpers (confidence_from_sources, answer_from_sources, _rank_sources). Plan cache lacks TTL and _rank_sources doesn't sanitize non-finite scores.
src/api/routes/memory.py search_memory rewritten to use raw_search + optional code domain search + optional answer synthesis. Adds _score_value helper. Exposes singleton pipeline's global latency snapshot in every SearchResponse.
src/api/schemas.py SearchRequest gains answer, org_id, repo fields and a strip_search_query validator; SearchResponse gains answer, model, confidence, latency fields. Validator can produce empty string after stripping whitespace-only input, bypassing min_length=1.
tests/integration/test_retrieval_pipeline.py Adds four new integration tests covering raw_search ranking, answer synthesis, non-finite score handling, and plan cache reuse. Tests are well-structured and cover the key new paths.

Sequence Diagram

sequenceDiagram
    participant Client
    participant SearchRoute as POST /v1/memory/search
    participant MemPipeline as RetrievalPipeline
    participant CodePipeline as CodeRetrievalPipeline
    participant LLM

    Client->>SearchRoute: "{query, domains, top_k, answer, org_id, repo}"
    alt "code" in domains AND no org_id
        SearchRoute-->>Client: 400 org_id required
    end

    SearchRoute->>MemPipeline: "raw_search(domains-code, include_answer=False)"
    Note over MemPipeline: profile from cache (5 min TTL)<br/>temporal, summary, snippet in parallel
    MemPipeline-->>SearchRoute: RetrievalResult.sources

    opt "code" in domains
        SearchRoute->>CodePipeline: _execute_tool(search_symbols) [parallel]
        SearchRoute->>CodePipeline: _execute_tool(search_files) [parallel]
        CodePipeline-->>SearchRoute: code SourceRecords
    end

    SearchRoute->>SearchRoute: sort all records by score desc

    opt "req.answer == True"
        SearchRoute->>MemPipeline: answer_from_sources(query, records)
        MemPipeline->>LLM: ainvoke(ANSWER_PROMPT)
        LLM-->>MemPipeline: answer text
        MemPipeline-->>SearchRoute: answer
    end

    SearchRoute-->>Client: "SearchResponse{results, total, answer, model, confidence, latency}"
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Normalize non-finite source scores" | Re-trigger Greptile

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Comments Outside Diff (5)

  1. src/api/routes/memory.py, line 760-776 (link)

    Cross-user latency telemetry in API response. pipeline is a singleton shared across all users, so latency_snapshot() returns aggregate p50/p95/p99 statistics accumulated from every request by every user. Including this in the per-request SearchResponse leaks system-wide load patterns to any API caller — e.g., a client could fingerprint how busy the service is based on changing percentiles.

  2. src/pipelines/retrieval.py, line 413-421 (link)

    Retrieval plan cache has no TTL. Unlike _profile_catalog_cache (300 s expiry), entries in _retrieval_plan_cache survive until LRU eviction. If the underlying model is swapped or tool schemas change between deployments, repeated identical queries (same user + query + catalog text) will keep executing the old tool plan indefinitely without ever re-asking the LLM.

  3. src/pipelines/retrieval.py, line 434-439 (link)

    Non-finite scores pass through the sort key unchanged. float('nan') is truthy in Python, so source.score if source.score is not None else 0.0 keeps nan intact. Python's TimSort produces implementation-defined (and potentially wrong) ordering when keys contain nan. float('inf') from code search results would always sort first. The _finite_score helper already exists for exactly this case.

  4. src/api/routes/memory.py, line 745 (link)

    The route-level sort uses s.score or 0.0, which does not sanitize float('nan') — nan is truthy, so nan or 0.0 returns nan. This is the same ordering hazard as in _rank_sources, and only affects code-search results since memory results are already sorted by _rank_sources before being merged here.

  5. src/api/schemas.py, line 191-194 (link)

    The strip_search_query validator runs after Pydantic's built-in constraint checks (including min_length=1), so a query of " " (one space) passes the length check and is then reduced to "". An empty string then reaches vector store embedding calls, which may error or return semantically useless results. The fix is to strip before the length constraint fires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants